This bike sharing dataset (hour.csv) was obtained from UCI machine learning repository. Below is information of the dataset extracted and modified from the included “Readme.txt” :
Bike sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return back has become automatic. Through these systems, the user is able to easily rent a bike from a particular position and return back to another position. Currently, there are about over 500 bike-sharing programs around the world which are composed of over 500 thousand bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
The bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C.
hour.csv
- instant: record index
- dteday : date
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of bike rented bycasual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered
Prediction of hourly bike rental count based on the environmental and seasonal settings.
Let’s read the input and take a look at its structure.
# Read
hour.data <- read.csv("hour.csv", header= TRUE,stringsAsFactors = FALSE)
# Overview
str(hour.data)
## 'data.frame': 17379 obs. of 17 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
head(hour.data)
## instant dteday season yr mnth hr holiday weekday workingday
## 1 1 2011-01-01 1 0 1 0 0 6 0
## 2 2 2011-01-01 1 0 1 1 0 6 0
## 3 3 2011-01-01 1 0 1 2 0 6 0
## 4 4 2011-01-01 1 0 1 3 0 6 0
## 5 5 2011-01-01 1 0 1 4 0 6 0
## 6 6 2011-01-01 1 0 1 5 0 6 0
## weathersit temp atemp hum windspeed casual registered cnt
## 1 1 0.24 0.2879 0.81 0.0000 3 13 16
## 2 1 0.22 0.2727 0.80 0.0000 8 32 40
## 3 1 0.22 0.2727 0.80 0.0000 5 27 32
## 4 1 0.24 0.2879 0.75 0.0000 3 10 13
## 5 1 0.24 0.2879 0.75 0.0000 0 1 1
## 6 2 0.24 0.2576 0.75 0.0896 0 1 1
Split data into training and testing datasets for applying models.
train <- hour.data[as.integer(substr(hour.data$dteday,9,10)) < 22, ]
test <- hour.data[as.integer(substr(hour.data$dteday,9,10)) > 21, ]
# Training: 69.2%
nrow(train)/ nrow(hour.data)
## [1] 0.6924449
# Testing: 30.7%
nrow(test)/ nrow(hour.data)
## [1] 0.3075551
GOAL: Apply different models to find predictive results of hourly total rental of a day
Technical explanation: For each model built upon training dataset, there will be predictive values against actual values of the testing dataset. The measurement used here is mean((y - yhat)^2) i.e. Mean Squared Error(MSE). We are going to apply multiple models and figure out the one that minimizes MSE. The total rental (cnt) of a day is the sum of registered users (registered) and casual users (casual). After trying different combinations, we found out that using 2 separate models to predict the number of registered users and casual users yield better result than using a single model to predict total rentals (cnt)
The followings are models we have applied:
The results for each model are “cnt.MSE”, “combined.MSE”, “registered.MSE”, “casual.MSE”
# Load packages
library(nnet)
library(ggplot2)
library(ggthemes)
library(gbm)
library(randomForest)
library(e1071)
library(rpart)
First, get data ready. Before factorizing some of the attributes, we leave numeric variables as they are for Neural Networks.
Step1: Orignal model
# Orignal Model
neural.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
neural.model = nnet(neural.formula, train, size=20, maxit=5000, linout=T, decay=0.01)
## # weights: 281
## initial value 832864009.618527
## iter 10 value 401885763.140496
## iter 20 value 348978754.378820
## iter 30 value 283838842.601240
## iter 40 value 279704475.218207
## iter 50 value 272687834.520042
## iter 60 value 258199071.170540
## iter 70 value 256275478.213297
## iter 80 value 243343886.713717
## iter 90 value 235322648.202721
## iter 100 value 233331727.993393
## iter 110 value 231727244.639550
## iter 120 value 225599415.914042
## iter 130 value 223289567.934950
## iter 140 value 221943319.078645
## iter 150 value 220557329.620354
## iter 160 value 220016346.942685
## iter 170 value 217213932.679547
## iter 180 value 216560694.258283
## iter 190 value 215558041.452651
## iter 200 value 213938999.014902
## iter 210 value 212820516.089442
## iter 220 value 210136264.007181
## iter 230 value 206102474.622544
## iter 240 value 205758459.787147
## iter 250 value 204431080.291385
## iter 260 value 202766208.932347
## iter 270 value 202076288.588333
## iter 280 value 201093977.567429
## iter 290 value 200667884.474503
## iter 300 value 199951095.739293
## iter 310 value 199426226.617432
## iter 320 value 198973143.088701
## iter 330 value 198289191.954146
## iter 340 value 195195864.541958
## iter 350 value 194522344.330864
## iter 360 value 194233159.991431
## iter 370 value 194020817.358865
## iter 380 value 193744875.825689
## iter 390 value 193613581.064182
## iter 400 value 193507008.131858
## iter 410 value 193212104.989112
## iter 420 value 193146714.611361
## iter 430 value 193041644.596600
## iter 440 value 192419689.254587
## iter 450 value 189402313.985063
## iter 460 value 187464739.089603
## iter 470 value 186850851.856394
## iter 480 value 186426292.245397
## iter 490 value 185347069.865467
## iter 500 value 185095904.084422
## iter 510 value 184918778.173578
## iter 520 value 184598360.161520
## iter 530 value 183801695.239463
## iter 540 value 183260818.813020
## iter 550 value 182766495.644261
## iter 560 value 182102017.387564
## iter 570 value 181748920.541705
## iter 580 value 181170863.952553
## iter 590 value 180574396.914170
## iter 600 value 180443288.568554
## iter 610 value 179357458.281890
## iter 620 value 178895627.321713
## iter 630 value 178608862.929772
## iter 640 value 178314365.405016
## iter 650 value 177373142.084943
## iter 660 value 177114001.979165
## iter 670 value 176539718.654416
## iter 680 value 175449727.670270
## iter 690 value 174994938.239167
## iter 700 value 174281063.378202
## iter 710 value 173231381.795726
## iter 720 value 170926867.407710
## iter 730 value 167644917.758654
## iter 740 value 166173089.236732
## iter 750 value 164797757.685431
## iter 760 value 164269185.172631
## iter 770 value 164174656.338741
## iter 780 value 163776497.285036
## iter 790 value 163186004.360750
## iter 800 value 162577528.947126
## iter 810 value 160953879.121289
## iter 820 value 160080875.285189
## iter 830 value 159070822.746583
## iter 840 value 158619092.588017
## iter 850 value 157722078.940917
## iter 860 value 157082571.804362
## iter 870 value 156410785.343051
## iter 880 value 155822574.044754
## iter 890 value 155619124.659299
## iter 900 value 155010728.721440
## iter 910 value 152345391.169099
## iter 920 value 150658174.793157
## iter 930 value 149832663.008073
## iter 940 value 148800574.883277
## iter 950 value 148448315.657276
## iter 960 value 147992464.911566
## iter 970 value 147746853.460349
## iter 980 value 147456629.392884
## iter 990 value 146857856.251058
## iter1000 value 145985501.219347
## iter1010 value 145417449.570299
## iter1020 value 144951764.465454
## iter1030 value 144750159.816561
## iter1040 value 144478285.585400
## iter1050 value 143674021.339089
## iter1060 value 143238846.447221
## iter1070 value 143009921.090186
## iter1080 value 142712813.649086
## iter1090 value 142613454.724923
## iter1100 value 142506053.820623
## iter1110 value 142433079.316332
## iter1120 value 142287834.081657
## iter1130 value 142071154.609680
## iter1140 value 141410306.816499
## iter1150 value 140829573.739404
## iter1160 value 140503734.565503
## iter1170 value 140361951.887020
## iter1180 value 140360057.755651
## iter1190 value 140356917.507246
## iter1200 value 140352733.973246
## iter1210 value 140349182.850679
## iter1220 value 140342667.931720
## iter1230 value 140315415.971716
## iter1240 value 140298176.299817
## iter1250 value 140269682.941451
## iter1260 value 140257838.802483
## iter1270 value 140241839.011219
## iter1280 value 140236931.028976
## iter1290 value 140233498.892464
## iter1300 value 140217764.052089
## iter1310 value 140209237.797160
## iter1320 value 140206761.340625
## iter1330 value 140198362.231654
## iter1340 value 140193584.109016
## iter1350 value 140187947.647596
## iter1360 value 140145685.010396
## iter1370 value 140078963.497798
## iter1380 value 140007183.508499
## iter1390 value 139962346.332180
## iter1400 value 139848384.007793
## iter1410 value 139793473.225434
## iter1420 value 139775220.188604
## iter1430 value 139733087.589917
## iter1440 value 139668965.210955
## iter1450 value 139538907.473781
## iter1460 value 139263862.836763
## iter1470 value 138913104.915816
## iter1480 value 138551011.261601
## iter1490 value 138201623.150725
## iter1500 value 137961884.771817
## iter1510 value 137423125.143519
## iter1520 value 137053700.173198
## iter1530 value 136792040.824953
## iter1540 value 136647222.812530
## iter1550 value 136457468.851357
## iter1560 value 136214535.404527
## iter1570 value 136105654.805139
## iter1580 value 136028183.824054
## iter1590 value 135941599.254453
## iter1600 value 135875421.337441
## iter1610 value 135843653.936822
## iter1620 value 135823828.205787
## iter1630 value 135755205.957870
## iter1640 value 135677418.608513
## iter1650 value 135556945.397105
## iter1660 value 135525881.861988
## iter1670 value 135479359.102094
## iter1680 value 135442002.268532
## iter1690 value 135404559.271277
## iter1700 value 135355071.531306
## iter1710 value 135306641.138099
## iter1720 value 135282310.716284
## iter1730 value 135260637.637358
## iter1740 value 135244842.780120
## iter1750 value 135203560.791885
## iter1760 value 135140072.245297
## iter1770 value 135101389.721845
## iter1780 value 135060575.892804
## iter1790 value 135019505.899904
## iter1800 value 134953680.490498
## iter1810 value 134925945.372750
## iter1820 value 134907925.709135
## iter1830 value 134898873.100306
## iter1840 value 134890823.455050
## iter1850 value 134867385.077882
## iter1860 value 134858584.495069
## iter1870 value 134849111.911654
## iter1880 value 134833801.007796
## iter1890 value 134802970.485236
## iter1900 value 134769999.492702
## iter1910 value 134662842.557927
## iter1920 value 134503320.037535
## iter1930 value 134352982.749919
## iter1940 value 134206612.273081
## iter1950 value 134043812.008293
## iter1960 value 133906790.278007
## iter1970 value 133770980.579687
## iter1980 value 133614632.261176
## iter1990 value 133424227.116161
## iter2000 value 133036524.485278
## iter2010 value 132672364.743171
## iter2020 value 131999467.059997
## iter2030 value 131033140.520079
## iter2040 value 130136690.386980
## iter2050 value 129328093.615910
## iter2060 value 128319051.198663
## iter2070 value 127531419.265729
## iter2080 value 126809004.747040
## iter2090 value 126273914.228043
## iter2100 value 125719602.555520
## iter2110 value 125235491.070953
## iter2120 value 124755489.351092
## iter2130 value 124019260.511297
## iter2140 value 123237742.484777
## iter2150 value 122565265.949055
## iter2160 value 121167363.842853
## iter2170 value 120135924.294864
## iter2180 value 118187137.757227
## iter2190 value 116795086.625470
## iter2200 value 114860023.305526
## iter2210 value 113715631.227274
## iter2220 value 112541771.605901
## iter2230 value 111881079.403062
## iter2240 value 111519893.653368
## iter2250 value 111105963.134541
## iter2260 value 110751570.141549
## iter2270 value 110384535.772715
## iter2280 value 109846009.217578
## iter2290 value 109222486.931579
## iter2300 value 108311333.334972
## iter2310 value 107668722.563999
## iter2320 value 106916152.291830
## iter2330 value 106268170.934974
## iter2340 value 105869098.495028
## iter2350 value 105724533.851466
## iter2360 value 105574481.157928
## iter2370 value 105326268.424668
## iter2380 value 105105862.460823
## iter2390 value 104841840.277096
## iter2400 value 104589077.083883
## iter2410 value 104412637.676957
## iter2420 value 104312591.577398
## iter2430 value 104242175.112225
## iter2440 value 104132198.727228
## iter2450 value 103987226.806518
## iter2460 value 103871011.222898
## iter2470 value 103667949.486414
## iter2480 value 103496753.046199
## iter2490 value 103249894.947678
## iter2500 value 103050737.378224
## iter2510 value 102818366.676224
## iter2520 value 102694703.071025
## iter2530 value 102598067.715097
## iter2540 value 102498507.633228
## iter2550 value 102416613.677689
## iter2560 value 102205503.967940
## iter2570 value 102077570.284993
## iter2580 value 101943186.663412
## iter2590 value 101808578.900904
## iter2600 value 101730222.697376
## iter2610 value 101686510.673185
## iter2620 value 101623588.182138
## iter2630 value 101509024.603968
## iter2640 value 101383850.517000
## iter2650 value 101252575.331983
## iter2660 value 101144652.555557
## iter2670 value 101050379.841317
## iter2680 value 100970168.238717
## iter2690 value 100816070.250370
## iter2700 value 100592107.623472
## iter2710 value 100362915.142496
## iter2720 value 100119872.997113
## iter2730 value 99829769.591708
## iter2740 value 99524330.520948
## iter2750 value 99285239.454003
## iter2760 value 99084400.044720
## iter2770 value 98631959.708164
## iter2780 value 98172410.544738
## iter2790 value 97620848.769242
## iter2800 value 97019844.224685
## iter2810 value 95936875.927412
## iter2820 value 95162460.817162
## iter2830 value 94684570.487954
## iter2840 value 94341586.395075
## iter2850 value 93547252.642290
## iter2860 value 92732895.879763
## iter2870 value 91150918.202953
## iter2880 value 90082804.362325
## iter2890 value 89348324.133822
## iter2900 value 89230446.258464
## iter2910 value 89149871.002549
## iter2920 value 89063134.829888
## iter2930 value 88863943.024427
## iter2940 value 88699553.381882
## iter2950 value 88534217.423704
## iter2960 value 88348440.896396
## iter2970 value 88246585.033001
## iter2980 value 88105752.822829
## iter2990 value 87951205.743158
## iter3000 value 87679477.656151
## iter3010 value 87520974.331242
## iter3020 value 87353890.652643
## iter3030 value 87174755.462827
## iter3040 value 87018572.871537
## iter3050 value 86790513.662254
## iter3060 value 86609220.107067
## iter3070 value 86480541.821860
## iter3080 value 86405319.708296
## iter3090 value 86346111.640003
## iter3100 value 86245547.663425
## iter3110 value 86083669.717867
## iter3120 value 85913068.788343
## iter3130 value 85719633.334796
## iter3140 value 85501589.411787
## iter3150 value 85056085.719742
## iter3160 value 84690018.448498
## iter3170 value 84358168.792454
## iter3180 value 83695654.799645
## iter3190 value 82201463.624141
## iter3200 value 80210449.256535
## iter3210 value 77281453.342227
## iter3220 value 71940969.624103
## iter3230 value 66862741.391722
## iter3240 value 58952532.971234
## iter3250 value 52403859.119344
## iter3260 value 50392785.057772
## iter3270 value 49205386.139975
## iter3280 value 48119962.682653
## iter3290 value 48014920.363831
## iter3300 value 47801842.214646
## iter3310 value 47627682.296674
## iter3320 value 47334459.417899
## iter3330 value 46860269.488157
## iter3340 value 46671526.420674
## iter3350 value 46570409.816785
## iter3360 value 46509335.792177
## iter3370 value 46412459.710476
## iter3380 value 46277453.749293
## iter3390 value 46166874.600040
## iter3400 value 45908182.440148
## iter3410 value 45666669.463691
## iter3420 value 45448015.224621
## iter3430 value 45189433.915287
## iter3440 value 44812108.012661
## iter3450 value 44491263.590546
## iter3460 value 44126947.140265
## iter3470 value 43785227.838795
## iter3480 value 43600527.571279
## iter3490 value 43304554.045295
## iter3500 value 43207150.360274
## iter3510 value 43123426.257024
## iter3520 value 43049502.115530
## iter3530 value 42967912.894376
## iter3540 value 42916106.809391
## iter3550 value 42873876.648944
## iter3560 value 42833648.807118
## iter3570 value 42782826.741581
## iter3580 value 42752689.351083
## iter3590 value 42707623.227127
## iter3600 value 42671474.813470
## iter3610 value 42629434.751530
## iter3620 value 42583094.180221
## iter3630 value 42535937.007251
## iter3640 value 42425747.447780
## iter3650 value 42320001.203367
## iter3660 value 42157460.466764
## iter3670 value 42071330.305366
## iter3680 value 42003015.985785
## iter3690 value 41934324.690591
## iter3700 value 41857025.685952
## iter3710 value 41744704.357937
## iter3720 value 41634807.993501
## iter3730 value 41521342.564599
## iter3740 value 41300779.286309
## iter3750 value 41186445.779121
## iter3760 value 41157086.537967
## iter3770 value 41117390.511119
## iter3780 value 41091381.793835
## iter3790 value 41051336.890762
## iter3800 value 41007579.037139
## iter3810 value 40968545.291314
## iter3820 value 40936319.898544
## iter3830 value 40920824.105293
## iter3840 value 40908722.948887
## iter3850 value 40898366.088890
## iter3860 value 40886624.607109
## iter3870 value 40869096.247405
## iter3880 value 40847176.066384
## iter3890 value 40832919.569154
## iter3900 value 40798412.783543
## iter3910 value 40760536.825810
## iter3920 value 40734547.308361
## iter3930 value 40716583.193658
## iter3940 value 40699799.105852
## iter3950 value 40688901.488405
## iter3960 value 40668574.772251
## iter3970 value 40664249.828013
## iter3980 value 40661378.940314
## iter3990 value 40658719.501348
## iter4000 value 40651929.515931
## iter4010 value 40643137.825240
## iter4020 value 40639239.982397
## iter4030 value 40622285.060594
## iter4040 value 40607816.289442
## iter4050 value 40563976.850564
## iter4060 value 40549022.548591
## iter4070 value 40534337.360902
## iter4080 value 40523953.217672
## iter4090 value 40511976.838580
## iter4100 value 40438773.129380
## iter4110 value 40388547.259580
## iter4120 value 40258235.678470
## iter4130 value 40104017.909276
## iter4140 value 40078092.409546
## iter4150 value 40042767.465935
## iter4160 value 40018378.825907
## iter4170 value 40006039.095887
## iter4180 value 39994625.175644
## iter4190 value 39978793.361146
## iter4200 value 39953685.462221
## iter4210 value 39923401.221300
## iter4220 value 39818158.189066
## iter4230 value 39731022.720401
## iter4240 value 39600079.654829
## iter4250 value 39418376.500320
## iter4260 value 39257446.058654
## iter4270 value 39101129.105406
## iter4280 value 38965174.792821
## iter4290 value 38826225.665225
## iter4300 value 38717336.243305
## iter4310 value 38619659.093649
## iter4320 value 38502651.215678
## iter4330 value 38377417.931270
## iter4340 value 38275532.777715
## iter4350 value 38140431.616971
## iter4360 value 37887902.065785
## iter4370 value 37559784.441424
## iter4380 value 37551056.347525
## iter4390 value 37537750.984021
## iter4400 value 37502171.909502
## iter4410 value 37443161.258541
## iter4420 value 37407000.099015
## iter4430 value 37381945.533374
## iter4440 value 37334221.344278
## iter4450 value 37289378.655420
## iter4460 value 37238552.656282
## iter4470 value 37215737.299979
## iter4480 value 37169730.613472
## iter4490 value 37108936.069496
## iter4500 value 37037177.987864
## iter4510 value 36984734.225380
## iter4520 value 36943032.875736
## iter4530 value 36835552.019373
## iter4540 value 36724094.132659
## iter4550 value 36643474.349354
## iter4560 value 36555214.887095
## iter4570 value 36494923.451563
## iter4580 value 36430144.620388
## iter4590 value 36373303.451998
## iter4600 value 36292317.709649
## iter4610 value 36226401.087591
## iter4620 value 36190665.010978
## iter4630 value 36135056.222518
## iter4640 value 36006938.096010
## iter4650 value 35945019.242678
## iter4660 value 35917414.343570
## iter4670 value 35899944.134005
## iter4680 value 35878392.391346
## iter4690 value 35851089.828150
## iter4700 value 35780496.629362
## iter4710 value 35679583.168144
## iter4720 value 35621059.530129
## iter4730 value 35575247.092409
## iter4740 value 35574064.435525
## iter4750 value 35572406.201197
## iter4760 value 35570825.181430
## iter4770 value 35567228.064440
## iter4780 value 35561470.239392
## iter4790 value 35558385.035623
## iter4800 value 35556643.138034
## iter4810 value 35552299.825962
## iter4820 value 35550679.870202
## iter4830 value 35549224.355189
## iter4840 value 35548005.242872
## iter4850 value 35546428.143204
## iter4860 value 35545817.598336
## iter4870 value 35542856.714897
## iter4880 value 35540119.317696
## iter4890 value 35537056.283580
## iter4900 value 35535371.889754
## iter4910 value 35533808.109805
## iter4920 value 35531825.289901
## iter4930 value 35529904.915196
## iter4940 value 35528203.923068
## iter4950 value 35527718.446946
## iter4960 value 35527101.438905
## iter4970 value 35526559.665075
## iter4980 value 35526074.399559
## iter4990 value 35525452.372632
## iter5000 value 35524796.123247
## final value 35524796.123247
## stopped after 5000 iterations
testset=subset(test, select = c("season", "yr" , "mnth" , "hr" , "holiday" , "weekday" , "workingday" , "weathersit" , "temp" , "atemp" , "hum" , "windspeed"))
neural.cnt = predict(neural.model, testset, type="raw")
test$neural.cnt=neural.cnt
# Compute MSE
neural.MSE = sum((test$cnt - test$neural.cnt)^2)/nrow(test)
neural.MSE
## [1] 4692.282
# Plot to check result
neural.result = ggplot(test,aes(cnt,neural.cnt))+geom_point()
neural.result
# Change negative result to positive
test$neural.cnt[test$neural.cnt < 0] = 0
# Compute new MSE
neural.MSE = sum((test$cnt - test$neural.cnt)^2)/nrow(test)
neural.MSE
## [1] 4518.081
neural.result = ggplot(test,aes(cnt,neural.cnt))+geom_point()
neural.result
Step2: Sepeate models for registered and casual
# Model for Registered
# Take off weedspeed and holiday yeild to the best result
neural.registered.formula = registered~season + yr + mnth + hr + weekday + workingday+ weathersit + temp + atemp + hum
neural.model.registered = nnet(neural.registered.formula, train, size=20, maxit=5000, linout=T, decay=0.01)
## # weights: 241
## initial value 568686130.343306
## iter 10 value 224799565.275934
## iter 20 value 194194898.396173
## iter 30 value 170533844.130250
## iter 40 value 160711617.275227
## iter 50 value 150633279.264054
## iter 60 value 140505887.758522
## iter 70 value 136342363.991998
## iter 80 value 131076423.221615
## iter 90 value 128382963.954093
## iter 100 value 127001110.380051
## iter 110 value 125358475.051638
## iter 120 value 123280037.651347
## iter 130 value 122322042.406086
## iter 140 value 121103625.573018
## iter 150 value 120269936.052892
## iter 160 value 119399754.438662
## iter 170 value 118821544.045348
## iter 180 value 118428125.019129
## iter 190 value 117793714.695103
## iter 200 value 117421597.430252
## iter 210 value 117141899.944874
## iter 220 value 116910419.527180
## iter 230 value 116546901.790359
## iter 240 value 116146375.033531
## iter 250 value 115837372.818173
## iter 260 value 115456621.413242
## iter 270 value 115239755.805721
## iter 280 value 114983999.130343
## iter 290 value 114787333.262960
## iter 300 value 114626610.339350
## iter 310 value 114339074.643109
## iter 320 value 114121281.525322
## iter 330 value 113942955.121875
## iter 340 value 113810679.790881
## iter 350 value 113636271.304112
## iter 360 value 113471359.104317
## iter 370 value 113324540.892058
## iter 380 value 113122892.739943
## iter 390 value 112932534.924182
## iter 400 value 112627641.244006
## iter 410 value 112222611.951328
## iter 420 value 111601706.485061
## iter 430 value 111059477.486920
## iter 440 value 104896513.362104
## iter 450 value 101444833.924661
## iter 460 value 98430853.467490
## iter 470 value 94783485.081091
## iter 480 value 92638794.426826
## iter 490 value 91380898.080639
## iter 500 value 90060759.025863
## iter 510 value 88188852.794439
## iter 520 value 86850753.270327
## iter 530 value 85669171.070149
## iter 540 value 84137775.599222
## iter 550 value 83065939.632805
## iter 560 value 82283239.421975
## iter 570 value 81264892.792213
## iter 580 value 80454428.471694
## iter 590 value 79283640.916300
## iter 600 value 78574142.832668
## iter 610 value 77720391.728255
## iter 620 value 76075589.932280
## iter 630 value 75710107.270133
## iter 640 value 74240170.429274
## iter 650 value 72741708.095957
## iter 660 value 71800505.303411
## iter 670 value 70805749.124546
## iter 680 value 70269177.195443
## iter 690 value 69862613.501493
## iter 700 value 69453248.539112
## iter 710 value 68937798.194501
## iter 720 value 68489212.621475
## iter 730 value 67956099.667945
## iter 740 value 67452545.438038
## iter 750 value 67044563.212355
## iter 760 value 66646259.402140
## iter 770 value 65990851.575152
## iter 780 value 64817028.772469
## iter 790 value 63713789.367602
## iter 800 value 62513688.676355
## iter 810 value 60760940.100978
## iter 820 value 59913282.830042
## iter 830 value 59272869.686933
## iter 840 value 58632273.005806
## iter 850 value 57510583.444708
## iter 860 value 56145554.925658
## iter 870 value 55434067.024778
## iter 880 value 54838616.285319
## iter 890 value 54323641.422294
## iter 900 value 53624080.549460
## iter 910 value 52956707.346642
## iter 920 value 52635372.364303
## iter 930 value 51768228.681180
## iter 940 value 49991741.172698
## iter 950 value 48431040.497803
## iter 960 value 47251983.361385
## iter 970 value 46406362.416056
## iter 980 value 45255677.910001
## iter 990 value 44396462.640040
## iter1000 value 43453833.500133
## iter1010 value 42246755.438191
## iter1020 value 41300827.155338
## iter1030 value 40181828.059796
## iter1040 value 38802226.846951
## iter1050 value 36659224.436161
## iter1060 value 35835862.424168
## iter1070 value 35126306.773055
## iter1080 value 34744471.938399
## iter1090 value 34409748.697047
## iter1100 value 34190840.598433
## iter1110 value 34046551.933988
## iter1120 value 33947549.198783
## iter1130 value 33813822.213267
## iter1140 value 33527859.164088
## iter1150 value 33442713.460972
## iter1160 value 33383658.102691
## iter1170 value 33106531.463549
## iter1180 value 33013919.252271
## iter1190 value 32917312.572135
## iter1200 value 32780783.952370
## iter1210 value 32672947.055460
## iter1220 value 32287881.669201
## iter1230 value 31833471.819893
## iter1240 value 31563191.286803
## iter1250 value 31424973.858167
## iter1260 value 31325370.189411
## iter1270 value 31226935.838929
## iter1280 value 31100518.507047
## iter1290 value 30968224.901222
## iter1300 value 30835138.017051
## iter1310 value 30782778.541381
## iter1320 value 30662402.074133
## iter1330 value 30599500.615899
## iter1340 value 30431708.636013
## iter1350 value 30085894.123280
## iter1360 value 29847754.132554
## iter1370 value 29550537.749332
## iter1380 value 29308610.874330
## iter1390 value 29102286.353159
## iter1400 value 28860305.526220
## iter1410 value 28510961.921467
## iter1420 value 28057774.401506
## iter1430 value 27740677.778889
## iter1440 value 27654613.382901
## iter1450 value 27591030.150826
## iter1460 value 27536666.609650
## iter1470 value 27466681.927161
## iter1480 value 27309354.697418
## iter1490 value 27149993.676055
## iter1500 value 27040181.734499
## iter1510 value 26977654.117402
## iter1520 value 26925982.303895
## iter1530 value 26863930.046467
## iter1540 value 26760689.780486
## iter1550 value 26626133.950351
## iter1560 value 26420247.688411
## iter1570 value 26075580.375922
## iter1580 value 25822384.225444
## iter1590 value 25601607.781477
## iter1600 value 25368897.272406
## iter1610 value 25170490.625616
## iter1620 value 25028139.287094
## iter1630 value 24825284.890338
## iter1640 value 24577194.480514
## iter1650 value 24346081.469685
## iter1660 value 24108864.919947
## iter1670 value 23888273.461398
## iter1680 value 23764230.558959
## iter1690 value 23636674.595007
## iter1700 value 23563332.657414
## iter1710 value 23511281.168031
## iter1720 value 23467058.758722
## iter1730 value 23433487.607479
## iter1740 value 23398064.530300
## iter1750 value 23366201.700543
## iter1760 value 23348061.092812
## iter1770 value 23331966.574344
## iter1780 value 23316625.199330
## iter1790 value 23283045.445964
## iter1800 value 23212762.839882
## iter1810 value 23199601.740732
## iter1820 value 23149277.537599
## iter1830 value 23031535.817209
## iter1840 value 22986124.312328
## iter1850 value 22965050.407372
## iter1860 value 22964166.868582
## iter1870 value 22963342.532504
## iter1880 value 22962256.595977
## iter1890 value 22959389.144952
## iter1900 value 22953435.618915
## iter1910 value 22945958.213617
## iter1920 value 22938575.605686
## iter1930 value 22935894.883595
## iter1940 value 22933076.146926
## iter1950 value 22931156.101491
## iter1960 value 22930119.979525
## iter1970 value 22928338.517067
## iter1980 value 22926169.610272
## iter1990 value 22920672.288320
## iter2000 value 22917352.162668
## iter2010 value 22915392.571248
## iter2020 value 22912890.518980
## iter2030 value 22910105.943298
## iter2040 value 22906684.916300
## iter2050 value 22902058.785450
## iter2060 value 22894934.741419
## iter2070 value 22894769.390257
## iter2080 value 22894017.619083
## iter2090 value 22893645.995594
## iter2100 value 22893158.792575
## iter2110 value 22892474.029236
## iter2120 value 22891293.912287
## iter2130 value 22890112.150415
## iter2140 value 22888728.224902
## iter2150 value 22885886.694130
## iter2160 value 22880879.411050
## iter2170 value 22874244.520433
## iter2180 value 22871111.912893
## iter2190 value 22868273.755771
## iter2200 value 22864857.553996
## iter2210 value 22854868.178776
## iter2220 value 22841422.976174
## iter2230 value 22828038.167142
## iter2240 value 22813608.164669
## iter2250 value 22794518.282414
## iter2260 value 22781650.038037
## iter2270 value 22773062.480465
## iter2280 value 22765433.936657
## iter2290 value 22754363.262957
## iter2300 value 22740885.580761
## iter2310 value 22661988.841929
## iter2320 value 22439888.524533
## iter2330 value 22392432.663666
## iter2340 value 22374906.752904
## iter2350 value 22363697.201925
## iter2360 value 22347027.524570
## iter2370 value 22335636.808002
## iter2380 value 22329324.723144
## iter2390 value 22324590.120690
## iter2400 value 22322884.856659
## iter2410 value 22322029.399783
## iter2420 value 22321771.755151
## iter2430 value 22321563.553685
## iter2440 value 22321520.333225
## final value 22321518.376554
## converged
neural.registered = predict(neural.model.registered, testset, type="raw")
test$neural.registered = neural.registered
neural.registered.MSE = sum((test$neural.registered-test$registered)^2)/nrow(test)
neural.registered.MSE
## [1] 3532.43
neural.registered.result = ggplot(test,aes(registered,neural.registered))+geom_point()
neural.registered.result
test$neural.registered[test$neural.registered < 0] = 0
neural.registered.MSE = sum((test$neural.registered-test$registered)^2)/nrow(test)
neural.registered.MSE
## [1] 3498.999
neural.registered.result = ggplot(test,aes(registered,neural.registered))+geom_point()
neural.registered.result
# Model for Casual
neural.casual.formula = casual~season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
neural.model.casual = nnet(neural.casual.formula, train, size=20, maxit=5000, linout=T, decay=0.01)
## # weights: 281
## initial value 45214417.527726
## iter 10 value 25783493.756270
## iter 20 value 22962102.714545
## iter 30 value 16358974.805088
## iter 40 value 15483419.331333
## iter 50 value 15078790.652825
## iter 60 value 13779035.637621
## iter 70 value 13305914.648520
## iter 80 value 12934003.414897
## iter 90 value 12800985.354611
## iter 100 value 12509742.874950
## iter 110 value 12306701.076024
## iter 120 value 12131452.035006
## iter 130 value 12053021.648556
## iter 140 value 11983109.590623
## iter 150 value 11887118.858882
## iter 160 value 11727088.933557
## iter 170 value 11285499.685422
## iter 180 value 10699923.927553
## iter 190 value 10267727.062433
## iter 200 value 10006026.126844
## iter 210 value 9843368.844550
## iter 220 value 9738816.580881
## iter 230 value 9540334.043399
## iter 240 value 9300095.902442
## iter 250 value 9088914.439828
## iter 260 value 8749717.702616
## iter 270 value 8548689.717004
## iter 280 value 8361239.749458
## iter 290 value 8162263.094508
## iter 300 value 8074811.202164
## iter 310 value 7964353.603733
## iter 320 value 7831518.483539
## iter 330 value 7733002.923687
## iter 340 value 7646911.817943
## iter 350 value 7580150.018505
## iter 360 value 7509823.435954
## iter 370 value 7361689.774464
## iter 380 value 7252306.864039
## iter 390 value 7153840.595112
## iter 400 value 7082238.262624
## iter 410 value 6939310.737458
## iter 420 value 6869245.509823
## iter 430 value 6826925.870942
## iter 440 value 6798973.319214
## iter 450 value 6777256.435256
## iter 460 value 6748886.392163
## iter 470 value 6734499.785346
## iter 480 value 6716341.605907
## iter 490 value 6673480.641991
## iter 500 value 6653973.033608
## iter 510 value 6557841.210427
## iter 520 value 6447766.584268
## iter 530 value 6357816.035876
## iter 540 value 6224701.276552
## iter 550 value 6107851.195948
## iter 560 value 5954178.763245
## iter 570 value 5760484.971669
## iter 580 value 5597390.524754
## iter 590 value 5519686.792756
## iter 600 value 5486457.395309
## iter 610 value 5453282.747432
## iter 620 value 5426984.439661
## iter 630 value 5398876.638860
## iter 640 value 5355504.034493
## iter 650 value 5308783.030668
## iter 660 value 5263009.152591
## iter 670 value 5229696.531953
## iter 680 value 5194169.521916
## iter 690 value 5146692.401054
## iter 700 value 5107869.123862
## iter 710 value 5075166.544935
## iter 720 value 5026504.098210
## iter 730 value 4986398.077278
## iter 740 value 4951505.847573
## iter 750 value 4912434.775103
## iter 760 value 4859121.101555
## iter 770 value 4843000.917812
## iter 780 value 4830431.970243
## iter 790 value 4819516.988498
## iter 800 value 4806590.632776
## iter 810 value 4787551.294771
## iter 820 value 4756784.775086
## iter 830 value 4718180.548716
## iter 840 value 4695069.267871
## iter 850 value 4659321.794708
## iter 860 value 4626455.838325
## iter 870 value 4603051.210041
## iter 880 value 4584934.027218
## iter 890 value 4566991.703409
## iter 900 value 4553145.249418
## iter 910 value 4551541.089303
## iter 920 value 4547737.475402
## iter 930 value 4544464.295164
## iter 940 value 4540956.630788
## iter 950 value 4534661.961958
## iter 960 value 4524831.950040
## iter 970 value 4516153.088589
## iter 980 value 4512660.261463
## iter 990 value 4506975.902519
## iter1000 value 4495678.399521
## iter1010 value 4484593.490755
## iter1020 value 4467433.439451
## iter1030 value 4444205.356998
## iter1040 value 4431006.906468
## iter1050 value 4410384.732804
## iter1060 value 4379567.727729
## iter1070 value 4341808.360800
## iter1080 value 4314928.252094
## iter1090 value 4294790.639326
## iter1100 value 4274064.028811
## iter1110 value 4242902.439389
## iter1120 value 4217833.959182
## iter1130 value 4190896.530046
## iter1140 value 4174981.323867
## iter1150 value 4169472.795145
## iter1160 value 4166575.582223
## iter1170 value 4162707.677102
## iter1180 value 4159334.656227
## iter1190 value 4152492.844283
## iter1200 value 4144540.884064
## iter1210 value 4136460.475406
## iter1220 value 4126135.160457
## iter1230 value 4122407.059942
## iter1240 value 4118787.703031
## iter1250 value 4111762.846614
## iter1260 value 4103697.203238
## iter1270 value 4088206.017027
## iter1280 value 4075833.265829
## iter1290 value 4066430.774257
## iter1300 value 4060506.064070
## iter1310 value 4052736.851262
## iter1320 value 4041811.217628
## iter1330 value 4031425.210711
## iter1340 value 4026749.903934
## iter1350 value 4025498.341836
## iter1360 value 4023869.508571
## iter1370 value 4021404.512217
## iter1380 value 4019915.782875
## iter1390 value 4018442.713556
## iter1400 value 4017234.919918
## iter1410 value 4016262.089610
## iter1420 value 4014921.324723
## iter1430 value 4014101.128800
## iter1440 value 4012600.162725
## iter1450 value 4010529.341469
## iter1460 value 4008086.602237
## iter1470 value 4004651.007804
## iter1480 value 4000344.429574
## iter1490 value 3994683.163734
## iter1500 value 3990199.690352
## iter1510 value 3984001.826169
## iter1520 value 3976405.268943
## iter1530 value 3967416.023342
## iter1540 value 3956987.962405
## iter1550 value 3947505.293533
## iter1560 value 3936936.614851
## iter1570 value 3931312.705042
## iter1580 value 3931097.277075
## iter1590 value 3930904.960513
## iter1600 value 3929773.584138
## iter1610 value 3928536.534401
## iter1620 value 3928218.698989
## iter1630 value 3927814.538134
## iter1640 value 3927402.352442
## iter1650 value 3926800.333832
## iter1660 value 3926180.964609
## iter1670 value 3925729.976555
## iter1680 value 3925143.085217
## iter1690 value 3924221.980571
## iter1700 value 3921462.189236
## iter1710 value 3917818.409777
## iter1720 value 3915092.724139
## iter1730 value 3911410.525292
## iter1740 value 3906676.938087
## iter1750 value 3901631.521972
## iter1760 value 3896365.526685
## iter1770 value 3891677.778219
## iter1780 value 3888163.829274
## iter1790 value 3884195.442252
## iter1800 value 3877544.108921
## iter1810 value 3872231.960313
## iter1820 value 3864672.353021
## iter1830 value 3859392.627049
## iter1840 value 3854192.220220
## iter1850 value 3846746.805129
## iter1860 value 3836808.198884
## iter1870 value 3820532.696806
## iter1880 value 3790393.541514
## iter1890 value 3752556.392863
## iter1900 value 3735919.175273
## iter1910 value 3726578.753731
## iter1920 value 3715237.782154
## iter1930 value 3704886.348018
## iter1940 value 3699151.815490
## iter1950 value 3692996.103018
## iter1960 value 3687724.276450
## iter1970 value 3675444.387122
## iter1980 value 3649861.598918
## iter1990 value 3631135.088745
## iter2000 value 3611972.056292
## iter2010 value 3593527.890049
## iter2020 value 3580408.562802
## iter2030 value 3575465.123942
## iter2040 value 3572211.026915
## iter2050 value 3564653.738449
## iter2060 value 3559072.499292
## iter2070 value 3555467.170889
## iter2080 value 3543808.758279
## iter2090 value 3501668.706058
## iter2100 value 3465438.454287
## iter2110 value 3440142.268250
## iter2120 value 3424005.108467
## iter2130 value 3395708.635432
## iter2140 value 3391161.937708
## iter2150 value 3390046.355434
## iter2160 value 3388617.337674
## iter2170 value 3386548.019477
## iter2180 value 3384379.152262
## iter2190 value 3381495.259089
## iter2200 value 3378097.545772
## iter2210 value 3376838.747234
## iter2220 value 3376024.073411
## iter2230 value 3374962.204020
## iter2240 value 3373413.345884
## iter2250 value 3371340.000523
## iter2260 value 3369582.691673
## iter2270 value 3368525.743365
## iter2280 value 3366721.780358
## iter2290 value 3362663.239400
## iter2300 value 3355778.371249
## iter2310 value 3346688.722351
## iter2320 value 3335279.454434
## iter2330 value 3326113.393524
## iter2340 value 3313314.449856
## iter2350 value 3303009.193281
## iter2360 value 3288566.708618
## iter2370 value 3271329.516374
## iter2380 value 3257206.944694
## iter2390 value 3255627.379885
## iter2400 value 3254923.894913
## iter2410 value 3253676.368964
## iter2420 value 3252495.312728
## iter2430 value 3251105.711448
## iter2440 value 3249411.129474
## iter2450 value 3248240.346191
## iter2460 value 3247380.881804
## iter2470 value 3246288.725664
## iter2480 value 3245476.066953
## iter2490 value 3244954.501926
## iter2500 value 3243967.544717
## iter2510 value 3243315.543059
## iter2520 value 3242517.337090
## iter2530 value 3240843.684409
## iter2540 value 3239110.654068
## iter2550 value 3235861.000880
## iter2560 value 3231090.358974
## iter2570 value 3226330.246352
## iter2580 value 3220304.966729
## iter2590 value 3209694.249868
## iter2600 value 3199657.704166
## iter2610 value 3183759.646358
## iter2620 value 3169101.262506
## iter2630 value 3160570.555632
## iter2640 value 3152608.308889
## iter2650 value 3144894.380890
## iter2660 value 3136228.238317
## iter2670 value 3131131.825520
## iter2680 value 3126564.176290
## iter2690 value 3117391.884958
## iter2700 value 3109175.896790
## iter2710 value 3104306.272823
## iter2720 value 3096846.620565
## iter2730 value 3086922.804714
## iter2740 value 3079400.409189
## iter2750 value 3070620.436893
## iter2760 value 3069409.069562
## iter2770 value 3067639.976095
## iter2780 value 3066989.574984
## iter2790 value 3066492.365718
## iter2800 value 3065834.139545
## iter2810 value 3065010.212070
## iter2820 value 3064583.958528
## iter2830 value 3064111.180737
## iter2840 value 3063459.960425
## iter2850 value 3062831.944474
## iter2860 value 3062266.900539
## iter2870 value 3061959.231378
## iter2880 value 3061769.590421
## iter2890 value 3061522.140490
## iter2900 value 3060968.867623
## iter2910 value 3060266.395464
## iter2920 value 3059146.094958
## iter2930 value 3058394.631878
## iter2940 value 3057064.514652
## iter2950 value 3056082.776272
## iter2960 value 3054805.188309
## iter2970 value 3053942.398245
## iter2980 value 3053325.214000
## iter2990 value 3052282.682880
## iter3000 value 3051231.078722
## iter3010 value 3049978.966722
## iter3020 value 3048185.089731
## iter3030 value 3046359.097667
## iter3040 value 3044846.632130
## iter3050 value 3040735.497704
## iter3060 value 3036674.778528
## iter3070 value 3035085.277855
## iter3080 value 3034462.855518
## iter3090 value 3034265.348796
## iter3100 value 3034247.820748
## iter3110 value 3034219.139840
## iter3120 value 3034179.503822
## iter3130 value 3034108.742281
## iter3140 value 3034006.475035
## iter3150 value 3033784.475263
## iter3160 value 3033470.586892
## iter3170 value 3033258.235492
## iter3180 value 3033167.790024
## iter3190 value 3033060.565063
## iter3200 value 3032952.185571
## iter3210 value 3032680.603648
## iter3220 value 3032320.979293
## iter3230 value 3032007.831483
## iter3240 value 3031557.109973
## iter3250 value 3030606.652430
## iter3260 value 3028644.102134
## iter3270 value 3027256.669538
## iter3280 value 3026026.905622
## iter3290 value 3024176.581816
## iter3300 value 3023129.200853
## iter3310 value 3022616.041283
## iter3320 value 3022398.285129
## iter3330 value 3022219.452207
## iter3340 value 3022032.601744
## iter3350 value 3021928.250158
## iter3360 value 3021832.199856
## iter3370 value 3021740.194871
## iter3380 value 3021679.640545
## iter3390 value 3021663.906033
## iter3400 value 3021662.449575
## iter3410 value 3021660.640546
## iter3420 value 3021655.153558
## iter3430 value 3021646.536963
## final value 3021645.984970
## converged
neural.casual = predict(neural.model.casual, testset, type="raw")
test$neural.casual = neural.casual
neural.casual.MSE = sum((test$neural.casual-test$casual)^2)/nrow(test)
neural.casual.MSE
## [1] 2323.752
neural.casual.result = ggplot(test,aes(casual,neural.casual))+geom_point()
neural.casual.result
test$neural.casual[test$neural.casual < 0] = 0
neural.casual.MSE = sum((test$neural.casual-test$casual)^2)/nrow(test)
neural.casual.MSE
## [1] 2302.934
neural.casual.result = ggplot(test,aes(casual,neural.casual))+geom_point()
neural.casual.result
Step3: Now combine the predicted registered users and casual users
test$neural.combined = test$neural.casual + test$neural.registered
neural.combined.MSE = sum((test$cnt-test$neural.combined)^2)/nrow(test)
neural.combined.MSE
## [1] 7194.954
Factorize data for the rest of models
# Factorization of training data
train$season <- factor(train$season)
train$yr <- factor(train$yr)
train$mnth <- factor(train$mnth)
train$hr <- factor(train$hr)
train$holiday <- factor(train$holiday)
train$weekday<- factor(train$weekday)
train$workingday <- factor(train$workingday)
train$weathersit <- factor(train$weathersit)
# Factorization of test data
test$season <- factor(test$season)
test$yr <- factor(test$yr)
test$mnth <- factor(test$mnth)
test$hr <- factor(test$hr)
test$holiday <- factor(test$holiday)
test$weekday<- factor(test$weekday)
test$workingday <- factor(test$workingday)
test$weathersit <- factor(test$weathersit)
Step1: Orignal model
# Orignal Model
lm.formula = cnt ~ season + yr + mnth + hr + holiday + weekday +
workingday + weathersit + temp + atemp + hum + windspeed
fit.lm = lm(lm.formula,data=train)
summary(fit.lm)
##
## Call:
## lm(formula = lm.formula, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -357.98 -60.35 -7.81 50.80 436.42
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.367 7.970 -10.962 < 2e-16 ***
## season2 30.622 15.167 2.019 0.043509 *
## season3 3.532 21.332 0.166 0.868512
## season4 37.104 14.999 2.474 0.013383 *
## yr1 87.418 1.859 47.022 < 2e-16 ***
## mnth2 11.189 4.568 2.449 0.014336 *
## mnth3 29.450 4.939 5.963 2.55e-09 ***
## mnth4 24.124 15.883 1.519 0.128824
## mnth5 47.844 15.993 2.991 0.002782 **
## mnth6 35.138 16.281 2.158 0.030926 *
## mnth7 35.805 22.074 1.622 0.104820
## mnth8 49.474 22.012 2.248 0.024623 *
## mnth9 77.941 21.826 3.571 0.000357 ***
## mnth10 60.024 16.067 3.736 0.000188 ***
## mnth11 42.706 15.780 2.706 0.006812 **
## mnth12 38.384 15.071 2.547 0.010883 *
## hr1 -17.267 6.337 -2.725 0.006441 **
## hr2 -27.562 6.357 -4.336 1.47e-05 ***
## hr3 -38.581 6.416 -6.013 1.88e-09 ***
## hr4 -39.780 6.396 -6.219 5.16e-10 ***
## hr5 -24.003 6.364 -3.772 0.000163 ***
## hr6 35.766 6.353 5.630 1.84e-08 ***
## hr7 170.904 6.344 26.938 < 2e-16 ***
## hr8 314.405 6.335 49.630 < 2e-16 ***
## hr9 166.689 6.341 26.287 < 2e-16 ***
## hr10 110.435 6.366 17.348 < 2e-16 ***
## hr11 137.322 6.412 21.417 < 2e-16 ***
## hr12 177.658 6.463 27.490 < 2e-16 ***
## hr13 173.189 6.516 26.580 < 2e-16 ***
## hr14 156.575 6.558 23.876 < 2e-16 ***
## hr15 167.026 6.569 25.426 < 2e-16 ***
## hr16 230.234 6.554 35.127 < 2e-16 ***
## hr17 387.221 6.517 59.413 < 2e-16 ***
## hr18 352.712 6.472 54.495 < 2e-16 ***
## hr19 240.526 6.408 37.538 < 2e-16 ***
## hr20 158.888 6.373 24.931 < 2e-16 ***
## hr21 108.764 6.346 17.138 < 2e-16 ***
## hr22 72.833 6.335 11.497 < 2e-16 ***
## hr23 33.563 6.330 5.302 1.17e-07 ***
## holiday1 -5.153 5.805 -0.888 0.374687
## weekday1 4.933 3.568 1.382 0.166908
## weekday2 10.852 3.447 3.149 0.001644 **
## weekday3 10.816 3.448 3.137 0.001711 **
## weekday4 11.964 3.441 3.477 0.000510 ***
## weekday5 18.493 3.448 5.364 8.30e-08 ***
## weekday6 20.338 3.420 5.946 2.82e-09 ***
## workingday1 NA NA NA NA
## weathersit2 -11.980 2.258 -5.306 1.14e-07 ***
## weathersit3 -65.555 3.806 -17.223 < 2e-16 ***
## weathersit4 -51.508 71.197 -0.723 0.469414
## temp 108.484 33.430 3.245 0.001177 **
## atemp 110.132 34.130 3.227 0.001255 **
## hum -80.389 6.680 -12.035 < 2e-16 ***
## windspeed -33.401 8.380 -3.986 6.77e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 100.4 on 11981 degrees of freedom
## Multiple R-squared: 0.6942, Adjusted R-squared: 0.6929
## F-statistic: 523.1 on 52 and 11981 DF, p-value: < 2.2e-16
lm.cnt=predict(fit.lm, test)
test$lm.cnt = lm.cnt
lm.MSE = sum((test$cnt - test$lm.cnt)^2)/nrow(test)
lm.MSE
## [1] 11297.42
lm.result = ggplot(test,aes(cnt,lm.cnt))+geom_point()
lm.result
test$lm.cnt[test$lm.cnt < 0] = 0
lm.MSE = sum((test$cnt - test$lm.cnt)^2)/nrow(test)
lm.MSE
## [1] 10738.16
lm.result = ggplot(test,aes(cnt,lm.cnt))+geom_point()
lm.result
Step2: Sepeate models for registered and casual
# Model for Registered
# Windspeed and atemp are taken out
lm.registered.formula = registered ~ season + yr + mnth + hr + holiday +
workingday + weathersit + temp + atemp + hum
lm.model.registered = lm(lm.registered.formula,data=train)
lm.registered = predict(lm.model.registered,test)
test$lm.registered = lm.registered
lm.registered.MSE = sum((test$lm.registered-test$registered)^2)/nrow(test)
lm.registered.MSE
## [1] 8157.388
lm.registered.result = ggplot(test,aes(registered,lm.registered))+geom_point()
lm.registered.result
test$lm.registered[test$lm.registered < 0] = 0
lm.registered.MSE = sum((test$lm.registered-test$registered)^2)/nrow(test)
lm.registered.MSE
## [1] 7763.378
lm.registered.result = ggplot(test,aes(registered,lm.registered))+geom_point()
lm.registered.result
# Model for Casual
# Windspeed is taken out
lm.casual.formula = casual ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum
lm.model.casual = lm(lm.casual.formula, data = train)
lm.casual = predict(lm.model.casual,test)
test$lm.casual = lm.casual
lm.casual.MSE = sum((test$lm.casual-test$casual)^2)/nrow(test)
lm.casual.MSE
## [1] 998.3204
lm.casual.result = ggplot(test,aes(casual,lm.casual))+geom_point()
lm.casual.result
test$lm.casual[test$lm.casual < 0] = 0
lm.casual.MSE = sum((test$lm.casual-test$casual)^2)/nrow(test)
lm.casual.MSE
## [1] 913.7805
lm.casual.result = ggplot(test,aes(casual,lm.casual))+geom_point()
lm.casual.result
Step3: Now combine the predicted registered users and casual users
test$lm.combined = test$lm.casual + test$lm.registered
lm.combined.MSE = sum((test$cnt-test$lm.combined)^2)/nrow(test)
lm.combined.MSE
## [1] 10684.49
Step1: Orignal model
# Orignal Model
svm.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
svm.model = svm(svm.formula, data = train)
svm.cnt=predict(svm.model, test)
test$svm.cnt = svm.cnt
svm.MSE = sum((test$cnt - test$svm.cnt)^2)/nrow(test)
svm.MSE
## [1] 7546.31
svm.result = ggplot(test,aes(cnt,svm.cnt))+geom_point()
svm.result
test$svm.cnt[test$svm.cnt < 0] = 0
svm.MSE = sum((test$cnt - test$svm.cnt)^2)/nrow(test)
svm.MSE
## [1] 7511.831
svm.result = ggplot(test,aes(cnt,svm.cnt))+geom_point()
svm.result
Step2: Sepeate models for registered and casual
# Model for Registered
svm.registered.formula = registered ~ season + yr + mnth + hr + weekday + workingday + temp + atemp + hum
svm.model.registered = svm(svm.registered.formula, data = train)
svm.registered = predict(svm.model.registered,test)
test$svm.registered = svm.registered
svm.registered.MSE = sum((test$svm.registered-test$registered)^2)/nrow(test)
svm.registered.MSE
## [1] 5870.63
svm.registered.result = ggplot(test,aes(registered,svm.registered))+geom_point()
svm.registered.result
test$svm.registered[test$svm.registered < 0] = 0
svm.registered.MSE = sum((test$svm.registered-test$registered)^2)/nrow(test)
svm.registered.MSE
## [1] 5854.001
svm.registered.result = ggplot(test,aes(registered,svm.registered))+geom_point()
svm.registered.result
# Model for Casual
svm.casual.formula = casual ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
svm.model.casual = svm(svm.casual.formula, data = train)
svm.casual = predict(svm.model.casual,test)
test$svm.casual = svm.casual
svm.casual.MSE = sum((test$svm.casual-test$casual)^2)/nrow(test)
svm.casual.MSE
## [1] 650.8312
svm.casual.result = ggplot(test,aes(casual,svm.casual))+geom_point()
svm.casual.result
test$svm.casual[test$svm.casual < 0] = 0
svm.casual.MSE = sum((test$svm.casual-test$casual)^2)/nrow(test)
svm.casual.MSE
## [1] 647.9083
svm.casual.result = ggplot(test,aes(casual,svm.casual))+geom_point()
svm.casual.result
Step3: Now combine the predicted registered users and casual users
test$svm.combined = test$svm.casual + test$svm.registered
svm.combined.MSE = sum((test$cnt-test$svm.combined)^2)/nrow(test)
svm.combined.MSE
## [1] 7655.972
Step1: Orignal model
# Orignal Model
forest.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
forest.model = randomForest(forest.formula, data = train, importance = TRUE, ntree = 200)
forest.model$importance
## %IncMSE IncNodePurity
## season 2011.56930 9278210.6
## yr 6192.83240 30285109.1
## mnth 3056.79334 17557971.9
## hr 38472.46107 209246414.1
## holiday 92.64652 880980.2
## weekday 3876.14948 17131864.1
## workingday 4389.96272 13304939.1
## weathersit 872.92779 6161000.9
## temp 4525.47739 25254204.6
## atemp 4557.41799 30696815.7
## hum 3484.10188 22285222.4
## windspeed 355.81994 6326822.3
forest.cnt = predict(forest.model,test)
test$forest.cnt = forest.cnt
forest.MSE = sum((test$cnt - test$forest.cnt)^2)/nrow(test)
forest.MSE
## [1] 3798.602
forest.result = ggplot(test,aes(cnt,forest.cnt))+geom_point()
forest.result
test$forest.cnt[test$forest.cnt < 0] = 0
forest.MSE = sum((test$cnt - test$forest.cnt)^2)/nrow(test)
forest.MSE
## [1] 3798.602
forest.result = ggplot(test,aes(cnt,forest.cnt))+geom_point()
forest.result
Step2: Sepeate models for registered and casual
# Model for Registered
forest.registered.formula = registered ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
forest.model.registered = randomForest(forest.registered.formula, data = train, importance = TRUE, ntree = 200)
forest.model.registered$importance
## %IncMSE IncNodePurity
## season 1408.11543 6510456.6
## yr 4731.70414 22483109.8
## mnth 2138.89288 11764156.9
## hr 29744.78947 153690889.0
## holiday 68.59201 646385.2
## weekday 3138.03807 14667232.8
## workingday 4044.50293 13589716.1
## weathersit 646.85498 4381252.9
## temp 2221.16143 10979597.2
## atemp 2409.19034 14759622.3
## hum 1839.72849 12772305.1
## windspeed 232.38954 4118188.1
forest.registered = predict(forest.model.registered,test)
test$forest.registered = forest.registered
forest.registered.MSE = sum((test$forest.registered-test$registered)^2)/nrow(test)
forest.registered.MSE
## [1] 2619.906
forest.registered.result = ggplot(test,aes(registered,forest.registered))+geom_point()
forest.registered.result
test$forest.registered[test$forest.registered < 0] = 0
forest.registered.MSE = sum((test$forest.registered-test$registered)^2)/nrow(test)
forest.registered.MSE
## [1] 2619.906
forest.registered.result = ggplot(test,aes(registered,forest.registered))+geom_point()
forest.registered.result
# Model for Casual
forest.casual.formula = casual ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
forest.model.casual = randomForest(forest.casual.formula, data = train, importance = TRUE, ntree = 160)
forest.model.casual$importance
## %IncMSE IncNodePurity
## season 146.91900 520619.5
## yr 237.04584 994050.4
## mnth 354.25313 1958205.0
## hr 1880.66235 9826946.5
## holiday 29.69380 171391.2
## weekday 753.85289 2932163.9
## workingday 841.01505 3023545.2
## weathersit 48.05173 316445.6
## temp 579.24601 3081440.2
## atemp 668.77214 4066669.9
## hum 311.39555 2149306.1
## windspeed 33.85808 528132.4
forest.casual = predict(forest.model.casual,test)
test$forest.casual = forest.casual
forest.casual.MSE = sum((test$forest.casual-test$casual)^2)/nrow(test)
forest.casual.MSE
## [1] 379.7289
forest.casual.result = ggplot(test,aes(casual,forest.casual))+geom_point()
forest.casual.result
test$forest.casual[test$forest.casual < 0] = 0
forest.casual.MSE = sum((test$forest.casual-test$casual)^2)/nrow(test)
forest.casual.MSE
## [1] 379.7289
forest.casual.result = ggplot(test,aes(casual,forest.casual))+geom_point()
forest.casual.result
Step3: Now combine the predicted registered users and casual users
test$forest.combined = test$forest.casual + test$forest.registered
forest.combined.MSE = sum((test$cnt-test$forest.combined)^2)/nrow(test)
forest.combined.MSE
## [1] 3515.571
Step1: Orignal model
# Orignal Model
gbm.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
gbm.model = gbm(gbm.formula, data=train, n.trees=1000, distribution="gaussian", interaction.depth=5, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)
summary(gbm.model)
## var rel.inf
## hr hr 59.5640798
## yr yr 10.1021434
## workingday workingday 7.3482236
## mnth mnth 5.7626710
## temp temp 5.0394995
## weekday weekday 3.5016279
## atemp atemp 2.5713378
## season season 2.0316668
## weathersit weathersit 1.9001772
## hum hum 1.6908947
## windspeed windspeed 0.3362276
## holiday holiday 0.1514507
pef.trees = gbm.perf(gbm.model)
## Using OOB method...
gbm.cnt = predict(gbm.model, newdata=test, n.trees=pef.trees)
test$gbm.cnt = gbm.cnt
gbm.MSE = sum((test$cnt - test$gbm.cnt)^2)/nrow(test)
gbm.MSE
## [1] 3572.765
gbm.result = ggplot(test,aes(cnt,gbm.cnt))+geom_point()
gbm.result
test$gbm.cnt[test$gbm.cnt < 0] = 0
gbm.MSE = sum((test$cnt - test$gbm.cnt)^2)/nrow(test)
gbm.MSE
## [1] 3556.995
gbm.result = ggplot(test,aes(cnt,gbm.cnt))+geom_point()
gbm.result
Step2: Sepeate models for registered and casual
# Model for Registered
# Take off holiday, weedspeed and hum give the best result
gbm.registered.formula = registered ~ season + yr + mnth + hr + weekday + workingday+ weathersit + temp
gbm.model.registered = gbm(gbm.registered.formula, data=train, n.trees=1000, distribution="gaussian", interaction.depth=5, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)
summary(gbm.model.registered)
## var rel.inf
## hr hr 59.422886
## workingday workingday 11.098504
## yr yr 10.946578
## mnth mnth 5.802469
## weekday weekday 4.369740
## temp temp 3.823567
## weathersit weathersit 2.524691
## season season 2.011565
perf.trees.registered = gbm.perf(gbm.model.registered)
## Using OOB method...
gbm.registered = predict(gbm.model.registered,newdata=test,n.trees = perf.trees.registered)
test$gbm.registered = gbm.registered
gbm.registered.MSE = sum((test$gbm.registered-test$registered)^2)/nrow(test)
gbm.registered.MSE
## [1] 2642.751
gbm.registered.result = ggplot(test,aes(registered,gbm.registered))+geom_point()
gbm.registered.result
test$gbm.registered[test$gbm.registered < 0] = 0
gbm.registered.MSE = sum((test$gbm.registered-test$registered)^2)/nrow(test)
gbm.registered.MSE
## [1] 2636.727
gbm.registered.result = ggplot(test,aes(registered,gbm.registered))+geom_point()
gbm.registered.result
# Model for Casual
gbm.casual.formula = casual ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
gbm.model.casual = gbm(gbm.casual.formula, data=train, n.trees=1000, distribution="gaussian", interaction.depth=5, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)
summary(gbm.model.casual)
## var rel.inf
## hr hr 34.9461746
## weekday weekday 12.5017985
## atemp atemp 11.8526189
## temp temp 11.7052478
## workingday workingday 11.1590400
## mnth mnth 7.2706955
## hum hum 4.1731473
## yr yr 4.0130881
## windspeed windspeed 0.8472661
## weathersit weathersit 0.8341478
## holiday holiday 0.4924107
## season season 0.2043647
perf.trees.casual = gbm.perf(gbm.model.casual)
## Using OOB method...
gbm.casual = predict(gbm.model.casual,newdata=test,n.trees = perf.trees.casual)
test$gbm.casual = gbm.casual
gbm.casual.MSE = sum((test$gbm.casual-test$casual)^2)/nrow(test)
gbm.casual.MSE
## [1] 344.1961
gbm.casual.result = ggplot(test,aes(casual,gbm.casual))+geom_point()
gbm.casual.result
test$gbm.casual[test$gbm.casual < 0] = 0
gbm.casual.MSE = sum((test$gbm.casual-test$casual)^2)/nrow(test)
gbm.casual.MSE
## [1] 343.1621
gbm.casual.result = ggplot(test,aes(casual,gbm.casual))+geom_point()
gbm.casual.result
Step3: Now combine the predicted registered users and casual users
test$gbm.combined = test$gbm.casual + test$gbm.registered
gbm.combined.MSE = sum((test$cnt-test$gbm.combined)^2)/nrow(test)
gbm.combined.MSE
## [1] 3369.337
Step1: Original model
# Orignal Model
formula.cnt <- cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
fit.rpart.cnt <- rpart(formula.cnt, method="anova", data= train)
print(summary(fit.rpart.cnt))
## Call:
## rpart(formula = formula.cnt, data = train, method = "anova")
## n= 12034
##
## CP nsplit rel error xerror xstd
## 1 0.36074746 0 1.0000000 1.0001864 0.016576271
## 2 0.10238796 1 0.6392525 0.6413939 0.012165579
## 3 0.05755831 2 0.5368646 0.5461442 0.009611061
## 4 0.03971267 3 0.4793063 0.4701578 0.008041499
## 5 0.03050477 4 0.4395936 0.4250503 0.007356162
## 6 0.02638756 5 0.4090888 0.3971615 0.006791662
## 7 0.02181266 6 0.3827013 0.3758256 0.006298499
## 8 0.02006172 7 0.3608886 0.3553995 0.006155750
## 9 0.01945226 8 0.3408269 0.3468612 0.006001831
## 10 0.01611249 10 0.3019223 0.3217805 0.005539678
## 11 0.01412510 11 0.2858099 0.2965783 0.005277021
## 12 0.01332590 12 0.2716848 0.2826774 0.005268221
## 13 0.01123726 13 0.2583589 0.2667504 0.005026355
## 14 0.01000000 14 0.2471216 0.2596578 0.004909933
##
## Variable importance
## hr yr mnth temp atemp hum
## 51 9 7 7 7 5
## season workingday weekday windspeed
## 5 4 4 1
##
## Node number 1: 12034 observations, complexity param=0.3607475
## mean=191.5313, MSE=32814.64
## left son=2 (4479 obs) right son=3 (7555 obs)
## Primary splits:
## hr splits as LLLLLLLRRRRRRRRRRRRRRRLL, improve=0.36074750, (0 missing)
## atemp < 0.5985 to the left, improve=0.12935650, (0 missing)
## temp < 0.55 to the left, improve=0.10419890, (0 missing)
## hum < 0.665 to the right, improve=0.07209467, (0 missing)
## yr splits as LR, improve=0.06667235, (0 missing)
## Surrogate splits:
## hum < 0.765 to the right, agree=0.663, adj=0.094, (0 split)
## windspeed < 0.09705 to the left, agree=0.632, adj=0.013, (0 split)
## temp < 0.15 to the left, agree=0.630, adj=0.007, (0 split)
## atemp < 0.08335 to the left, agree=0.629, adj=0.003, (0 split)
##
## Node number 2: 4479 observations, complexity param=0.0141251
## mean=50.22483, MSE=3239.629
## left son=4 (2968 obs) right son=5 (1511 obs)
## Primary splits:
## hr splits as LLLLLLR---------------RR, improve=0.38440840, (0 missing)
## temp < 0.47 to the left, improve=0.06934414, (0 missing)
## atemp < 0.4621 to the left, improve=0.06834863, (0 missing)
## mnth splits as LLLLRRRRRRLL, improve=0.05825211, (0 missing)
## season splits as LRRR, improve=0.05093041, (0 missing)
## Surrogate splits:
## temp < 0.77 to the left, agree=0.666, adj=0.009, (0 split)
## atemp < 0.73485 to the left, agree=0.665, adj=0.008, (0 split)
## hum < 0.11 to the right, agree=0.663, adj=0.001, (0 split)
##
## Node number 3: 7555 observations, complexity param=0.102388
## mean=275.3052, MSE=31492.39
## left son=6 (5036 obs) right son=7 (2519 obs)
## Primary splits:
## hr splits as -------LRLLLLLLLRRRRLL--, improve=0.1699364, (0 missing)
## yr splits as LR, improve=0.1505464, (0 missing)
## temp < 0.49 to the left, improve=0.1391370, (0 missing)
## atemp < 0.47725 to the left, improve=0.1364355, (0 missing)
## season splits as LRRR, improve=0.1233174, (0 missing)
## Surrogate splits:
## windspeed < 0.7985 to the left, agree=0.667, adj=0.001, (0 split)
## weathersit splits as LLLR, agree=0.667, adj=0.000, (0 split)
## atemp < 0.9015 to the left, agree=0.667, adj=0.000, (0 split)
##
## Node number 4: 2968 observations
## mean=25.04549, MSE=933.8123
##
## Node number 5: 1511 observations
## mean=99.68365, MSE=4077.341
##
## Node number 6: 5036 observations, complexity param=0.03971267
## mean=223.5663, MSE=17665.7
## left son=12 (2516 obs) right son=13 (2520 obs)
## Primary splits:
## yr splits as LR, improve=0.1762747, (0 missing)
## temp < 0.45 to the left, improve=0.1504354, (0 missing)
## atemp < 0.44695 to the left, improve=0.1476129, (0 missing)
## season splits as LRRR, improve=0.1474771, (0 missing)
## mnth splits as LLLRRRRRRRRR, improve=0.1461857, (0 missing)
## Surrogate splits:
## hum < 0.505 to the right, agree=0.535, adj=0.070, (0 split)
## temp < 0.55 to the left, agree=0.529, adj=0.057, (0 split)
## atemp < 0.52275 to the left, agree=0.528, adj=0.056, (0 split)
## windspeed < 0.26865 to the right, agree=0.527, adj=0.053, (0 split)
## weathersit splits as RLL-, agree=0.512, adj=0.024, (0 split)
##
## Node number 7: 2519 observations, complexity param=0.05755831
## mean=378.742, MSE=43083.91
## left son=14 (1259 obs) right son=15 (1260 obs)
## Primary splits:
## yr splits as LR, improve=0.2094317, (0 missing)
## temp < 0.49 to the left, improve=0.1993776, (0 missing)
## atemp < 0.47725 to the left, improve=0.1949973, (0 missing)
## season splits as LRRR, improve=0.1673958, (0 missing)
## mnth splits as LLLRRRRRRRRR, improve=0.1626678, (0 missing)
## Surrogate splits:
## hum < 0.445 to the right, agree=0.554, adj=0.107, (0 split)
## atemp < 0.58335 to the left, agree=0.533, adj=0.066, (0 split)
## temp < 0.59 to the left, agree=0.532, adj=0.064, (0 split)
## windspeed < 0.31345 to the right, agree=0.515, adj=0.030, (0 split)
## weathersit splits as RRLR, agree=0.509, adj=0.018, (0 split)
##
## Node number 12: 2516 observations, complexity param=0.01611249
## mean=167.7186, MSE=9589.757
## left son=24 (836 obs) right son=25 (1680 obs)
## Primary splits:
## mnth splits as LLLLRRRRRRRR, improve=0.2637072, (0 missing)
## season splits as LRRR, improve=0.2394119, (0 missing)
## atemp < 0.4621 to the left, improve=0.2192312, (0 missing)
## temp < 0.47 to the left, improve=0.2192312, (0 missing)
## workingday splits as RL, improve=0.0625818, (0 missing)
## Surrogate splits:
## season splits as LRRR, agree=0.909, adj=0.725, (0 split)
## atemp < 0.35605 to the left, agree=0.793, adj=0.378, (0 split)
## temp < 0.39 to the left, agree=0.791, adj=0.372, (0 split)
## hum < 0.405 to the left, agree=0.688, adj=0.060, (0 split)
## windspeed < 0.37315 to the right, agree=0.687, adj=0.057, (0 split)
##
## Node number 13: 2520 observations, complexity param=0.02006172
## mean=279.3254, MSE=19505.74
## left son=26 (721 obs) right son=27 (1799 obs)
## Primary splits:
## temp < 0.39 to the left, improve=0.1611695, (0 missing)
## atemp < 0.4015 to the left, improve=0.1571376, (0 missing)
## mnth splits as LLRRRRRRRRRR, improve=0.1559454, (0 missing)
## season splits as LRRR, improve=0.1512278, (0 missing)
## weekday splits as RLLLLLR, improve=0.0617536, (0 missing)
## Surrogate splits:
## atemp < 0.4015 to the left, agree=0.996, adj=0.986, (0 split)
## mnth splits as LLRRRRRRRRLL, agree=0.873, adj=0.558, (0 split)
## season splits as LRRR, agree=0.803, adj=0.311, (0 split)
## hum < 0.91 to the right, agree=0.715, adj=0.004, (0 split)
##
## Node number 14: 1259 observations, complexity param=0.02181266
## mean=283.7141, MSE=22177.25
## left son=28 (524 obs) right son=29 (735 obs)
## Primary splits:
## mnth splits as LLLLRRRRRRRL, improve=0.3084984, (0 missing)
## season splits as LRRR, improve=0.3003280, (0 missing)
## temp < 0.47 to the left, improve=0.2964665, (0 missing)
## atemp < 0.4621 to the left, improve=0.2964665, (0 missing)
## workingday splits as LR, improve=0.1184052, (0 missing)
## Surrogate splits:
## temp < 0.47 to the left, agree=0.841, adj=0.618, (0 split)
## atemp < 0.4621 to the left, agree=0.841, adj=0.618, (0 split)
## season splits as LRRR, agree=0.833, adj=0.599, (0 split)
## hum < 0.385 to the left, agree=0.626, adj=0.101, (0 split)
## windspeed < 0.31345 to the right, agree=0.622, adj=0.092, (0 split)
##
## Node number 15: 1260 observations, complexity param=0.03050477
## mean=473.6944, MSE=45934.88
## left son=30 (400 obs) right son=31 (860 obs)
## Primary splits:
## workingday splits as LR, improve=0.2081288, (0 missing)
## temp < 0.43 to the left, improve=0.1949022, (0 missing)
## atemp < 0.4318 to the left, improve=0.1887979, (0 missing)
## mnth splits as LLRRRRRRRRRR, improve=0.1842011, (0 missing)
## weekday splits as LRRRRRL, improve=0.1820905, (0 missing)
## Surrogate splits:
## weekday splits as LRRRRRL, agree=0.968, adj=0.900, (0 split)
## holiday splits as RL, agree=0.714, adj=0.100, (0 split)
## atemp < 0.1894 to the left, agree=0.689, adj=0.020, (0 split)
## hum < 0.195 to the left, agree=0.686, adj=0.010, (0 split)
## temp < 0.93 to the right, agree=0.685, adj=0.007, (0 split)
##
## Node number 24: 836 observations
## mean=96.43062, MSE=3492.994
##
## Node number 25: 1680 observations
## mean=203.1929, MSE=8836.312
##
## Node number 26: 721 observations
## mean=190.7587, MSE=9708.069
##
## Node number 27: 1799 observations, complexity param=0.01945226
## mean=314.821, MSE=19028.77
## left son=54 (1240 obs) right son=55 (559 obs)
## Primary splits:
## workingday splits as RL, improve=0.13644570, (0 missing)
## weekday splits as RLLLLLR, improve=0.13176200, (0 missing)
## hr splits as -------R-RLRRRRR----RL--, improve=0.05825584, (0 missing)
## weathersit splits as RRL-, improve=0.05455847, (0 missing)
## atemp < 0.61365 to the left, improve=0.05099518, (0 missing)
## Surrogate splits:
## weekday splits as RLLLLLR, agree=0.974, adj=0.916, (0 split)
## holiday splits as LR, agree=0.715, adj=0.084, (0 split)
## atemp < 0.85605 to the left, agree=0.694, adj=0.016, (0 split)
## temp < 0.95 to the left, agree=0.692, adj=0.009, (0 split)
## hum < 0.205 to the right, agree=0.690, adj=0.004, (0 split)
##
## Node number 28: 524 observations
## mean=185.7519, MSE=11624.71
##
## Node number 29: 735 observations
## mean=353.5537, MSE=17981.19
##
## Node number 30: 400 observations, complexity param=0.01123726
## mean=330.325, MSE=32555.74
## left son=60 (164 obs) right son=61 (236 obs)
## Primary splits:
## temp < 0.49 to the left, improve=0.3407615, (0 missing)
## atemp < 0.47725 to the left, improve=0.3407615, (0 missing)
## hr splits as --------L-------RRRL----, improve=0.2724685, (0 missing)
## mnth splits as LLRRRRRRRRRL, improve=0.2545665, (0 missing)
## season splits as LRRR, improve=0.1791564, (0 missing)
## Surrogate splits:
## atemp < 0.47725 to the left, agree=1.000, adj=1.000, (0 split)
## season splits as LRRL, agree=0.878, adj=0.701, (0 split)
## mnth splits as LLRRRRRRRLLL, agree=0.878, adj=0.701, (0 split)
## hum < 0.745 to the right, agree=0.638, adj=0.116, (0 split)
## weathersit splits as RRL-, agree=0.612, adj=0.055, (0 split)
##
## Node number 31: 860 observations, complexity param=0.02638756
## mean=540.3779, MSE=38150.68
## left son=62 (344 obs) right son=63 (516 obs)
## Primary splits:
## hr splits as --------R-------LRRL----, improve=0.3175968, (0 missing)
## mnth splits as LLLRRRRRRRLL, improve=0.2233810, (0 missing)
## season splits as LRRR, improve=0.2183222, (0 missing)
## temp < 0.49 to the left, improve=0.1973805, (0 missing)
## atemp < 0.47725 to the left, improve=0.1920992, (0 missing)
## Surrogate splits:
## temp < 0.87 to the right, agree=0.602, adj=0.006, (0 split)
## atemp < 0.75 to the right, agree=0.602, adj=0.006, (0 split)
## hum < 0.215 to the left, agree=0.601, adj=0.003, (0 split)
## windspeed < 0.56715 to the right, agree=0.601, adj=0.003, (0 split)
##
## Node number 54: 1240 observations
## mean=280.6089, MSE=8820.401
##
## Node number 55: 559 observations, complexity param=0.01945226
## mean=390.712, MSE=33317.61
## left son=110 (204 obs) right son=111 (355 obs)
## Primary splits:
## hr splits as -------L-LRRRRRR----LL--, improve=0.57408930, (0 missing)
## hum < 0.575 to the right, improve=0.24107720, (0 missing)
## atemp < 0.61365 to the left, improve=0.10204650, (0 missing)
## temp < 0.55 to the left, improve=0.06584068, (0 missing)
## windspeed < 0.31345 to the left, improve=0.04022250, (0 missing)
## Surrogate splits:
## hum < 0.715 to the right, agree=0.696, adj=0.167, (0 split)
##
## Node number 60: 164 observations
## mean=203.9756, MSE=14611.74
##
## Node number 61: 236 observations
## mean=418.1271, MSE=26222.35
##
## Node number 62: 344 observations
## mean=405.564, MSE=17641.27
##
## Node number 63: 516 observations, complexity param=0.0133259
## mean=630.2539, MSE=31629.39
## left son=126 (213 obs) right son=127 (303 obs)
## Primary splits:
## mnth splits as LLLRRRRRRRLL, improve=0.3224286, (0 missing)
## temp < 0.49 to the left, improve=0.3217399, (0 missing)
## season splits as LRRR, improve=0.3196353, (0 missing)
## atemp < 0.47725 to the left, improve=0.3146083, (0 missing)
## weathersit splits as RRLL, improve=0.1475443, (0 missing)
## Surrogate splits:
## season splits as LRRL, agree=0.913, adj=0.789, (0 split)
## temp < 0.49 to the left, agree=0.888, adj=0.728, (0 split)
## atemp < 0.47725 to the left, agree=0.882, adj=0.714, (0 split)
## hum < 0.92 to the right, agree=0.597, adj=0.023, (0 split)
## windspeed < 0.43285 to the right, agree=0.597, adj=0.023, (0 split)
##
## Node number 110: 204 observations
## mean=208.2696, MSE=11069.36
##
## Node number 111: 355 observations
## mean=495.5521, MSE=15983.78
##
## Node number 126: 213 observations
## mean=509.8075, MSE=21483.6
##
## Node number 127: 303 observations
## mean=714.9241, MSE=21394.31
##
## n= 12034
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 12034 394891300 191.53130
## 2) hr=0,1,2,3,4,5,6,22,23 4479 14510300 50.22483
## 4) hr=0,1,2,3,4,5 2968 2771555 25.04549 *
## 5) hr=6,22,23 1511 6160863 99.68365 *
## 3) hr=7,8,9,10,11,12,13,14,15,16,17,18,19,20,21 7555 237925000 275.30520
## 6) hr=7,9,10,11,12,13,14,15,20,21 5036 88964490 223.56630
## 12) yr=0 2516 24127830 167.71860
## 24) mnth=1,2,3,4 836 2920143 96.43062 *
## 25) mnth=5,6,7,8,9,10,11,12 1680 14845000 203.19290 *
## 13) yr=1 2520 49154470 279.32540
## 26) temp< 0.39 721 6999518 190.75870 *
## 27) temp>=0.39 1799 34232750 314.82100
## 54) workingday=1 1240 10937300 280.60890 *
## 55) workingday=0 559 18624540 390.71200
## 110) hr=7,9,20,21 204 2258150 208.26960 *
## 111) hr=10,11,12,13,14,15 355 5674242 495.55210 *
## 7) hr=8,16,17,18,19 2519 108528400 378.74200
## 14) yr=0 1259 27921150 283.71410
## 28) mnth=1,2,3,4,12 524 6091350 185.75190 *
## 29) mnth=5,6,7,8,9,10,11 735 13216170 353.55370 *
## 15) yr=1 1260 57877950 473.69440
## 30) workingday=0 400 13022300 330.32500
## 60) temp< 0.49 164 2396326 203.97560 *
## 61) temp>=0.49 236 6188474 418.12710 *
## 31) workingday=1 860 32809580 540.37790
## 62) hr=16,19 344 6068599 405.56400 *
## 63) hr=8,17,18 516 16320770 630.25390
## 126) mnth=1,2,3,11,12 213 4576007 509.80750 *
## 127) mnth=4,5,6,7,8,9,10 303 6482477 714.92410 *
# Access significant vairables
fit.rpart.cnt$variable.importance
## hr yr mnth temp atemp hum
## 209578416 38411469 27767349 27500369 27021271 20736191
## season workingday weekday windspeed holiday weathersit
## 19497245 16716978 15119647 4642626 1597331 1048780
# Validate the fit.rpart model using testing data
test.cnt <- test[, "cnt"]
test.x <- test[, 3:14]
rpart.cnt <- predict(fit.rpart.cnt, test.x)
test$rpart.cnt <- rpart.cnt
rpart.MSE = mean((rpart.cnt - test.cnt)^2)
rpart.MSE
## [1] 9523.749
rpart.result = ggplot(test,aes(cnt,rpart.cnt))+geom_point()
rpart.result
Step2: Sepeate models for registered and casual
# Model for Registered
formula.registered <- registered ~ season + yr + mnth + hr + holiday + weekday + workingday + weathersit + temp + atemp + hum + windspeed
fit.rpart.registered <- rpart(formula.registered, method="anova", data= train)
test.registered <- test[, "registered"]
rpart.registered <- predict(fit.rpart.registered, test.x)
test$rpart.registered <- rpart.registered
rpart.registered.MSE = mean((rpart.registered - test.registered)^2)
rpart.registered.MSE
## [1] 7096.629
rpart.registered.result = ggplot(test,aes(registered,rpart.registered))+geom_point()
rpart.registered.result
# Model for Casual
formula.casual <- casual ~ season + yr + mnth + hr + holiday + weekday + workingday + weathersit + temp + atemp + hum + windspeed
fit.rpart.casual <- rpart(formula.casual, method="anova", data= train)
test.casual <- test[, "casual"]
rpart.casual <- predict(fit.rpart.casual, test.x)
test$rpart.casual <- rpart.casual
rpart.casual.MSE = mean((rpart.casual - test.casual)^2)
rpart.casual.MSE
## [1] 661.3148
rpart.casual.result = ggplot(test,aes(casual,rpart.casual))+geom_point()
rpart.casual.result
Step3: Now combine the predicted registered users and casual users
rpart.combined = rpart.casual + rpart.registered
test$rpart.combined <- rpart.combined
rpart.combined.MSE = sum((test$cnt-test$rpart.combined)^2)/nrow(test)
rpart.combined.MSE
## [1] 8568.5
Now let’s compare the results of all the models, the comparism table.
rpart.MSEs <- c(rpart.MSE, rpart.combined.MSE, rpart.registered.MSE, rpart.casual.MSE )
rpart.MSEs <- matrix(rpart.MSEs, nrow= 1, ncol=4)
colnames(rpart.MSEs) <-c("cnt.MSE", "combined.MSE", "registered.MSE", "casual.MSE" )
rownames(rpart.MSEs) <- "rpart.MSEs"
lm.MSEs = c(lm.MSE, lm.combined.MSE, lm.registered.MSE, lm.casual.MSE)
forest.MSEs = c(forest.MSE, forest.combined.MSE, forest.registered.MSE, forest.casual.MSE )
svm.MSEs = c(svm.MSE, svm.combined.MSE, svm.registered.MSE, svm.casual.MSE)
neural.MSEs = c(neural.MSE, neural.combined.MSE, neural.registered.MSE, neural.casual.MSE)
gbm.MSEs = c(gbm.MSE, gbm.combined.MSE, gbm.registered.MSE, gbm.casual.MSE)
Summary.MSE=rbind(rpart.MSEs, forest.MSEs, svm.MSEs, neural.MSEs, gbm.MSEs, lm.MSEs)
Summary.MSE
## cnt.MSE combined.MSE registered.MSE casual.MSE
## rpart.MSEs 9523.749 8568.500 7096.629 661.3148
## forest.MSEs 3798.602 3515.571 2619.906 379.7289
## svm.MSEs 7511.831 7655.972 5854.001 647.9083
## neural.MSEs 4518.081 7194.954 3498.999 2302.9344
## gbm.MSEs 3556.995 3369.337 2636.727 343.1621
## lm.MSEs 10738.163 10684.486 7763.378 913.7805
As shown above, as of registered users, the random forest provides the best prediction, i.e. lowest MSE, and of casual users, GBM yields the best result. Notice predicted results of random forest and GBM contain unwanted negative values, so we have to manually convert them to 0.
This transformation is crucial because the final output we like to generate is the total number of rentals, which are derived from the predicted registered users (by RandomForest) and predicted casual users (by GBM). Since each output of a model includes negative values, we did transformation before adding those predicted values. Finally, with this additional step, we got the least MSE, which essential would be our ideal model.
After knowing the relative influence (from gbm) of “cnt”, “registered”, “casual”, we attemped to visualize the story behind the scene.
Important Variables (Relative Influence) cnt: hr, yr, workingday, temp, mnth, weekday….. registered: hr, workingday, yr, mnth, weekday….. casual: hr, weekday, temp, workingday…..
Because of the similarities among these factors, we subjectively grouped them into 4 different factor groups and visualized the plottings against “cnt” (Total users), “registered” (Registered users), and “casual” (Casual users).
Factor groups:
(1) hr
(2) yr, mnth
(3) working, weekday, holiday
(4) temp, atemp, humidity
In terms of the number of attributes, we couldn’t do all the plottings between each two. Therefore, to be logical, we showed the most significant relationships between x’s(e.g. hr, yr ..) and y’s (e.g. cnt, registered, casual).
# Ratio of 2 types of users
r.registered <- sum(hour.data$registered) / sum(hour.data$cnt)
r.casual <- (1- r.registered)
print(c("Registered %",r.registered ))
## [1] "Registered %" "0.811698316173547"
print(c("Causal %", r.casual))
## [1] "Causal %" "0.188301683826453"
Registerd users majorly account for the rental usage. Because “hr” is the most significant factor, let’s see the initial plot between “hr” and “cnt”.
plot(x= hour.data$hr, y= hour.data$cnt)
From this plot, we assume there exists an interesting pattern - peak period. As we can see that the cnt stick out during 7-9am and 5-7pm, which coincides with the peak periods when people go to work in the morning and when people get off from work in the afternoon. Let’s see the finer grained plots.
Since hour is the most significant factor for both registered and casual users, let’s see how the cnt, registered and casual fluctuate with the hour.
As the matter of fact that such rush-hours patterns exist, we divided the hour factor into 5 segments to better visualize and understand how rental number changes for both registered users and casual users with the time of a day.
# Create daypart column, default to "Night"
hour.data$daypart <- "Night"
# 0am -7am: "Early morning"
hour.data$daypart[(hour.data$hr >=0 ) & (hour.data$hr <7 )] <- "Early Morning"
# 7am- 9am : "Peak Morning"
hour.data$daypart[(hour.data$hr >=7 ) & (hour.data$hr <9 )] <- "Peak Morning"
# 9am- 5pm : "Day"
hour.data$daypart[(hour.data$hr >=9 ) & (hour.data$hr <17 )] <- "Day"
# 5pm- 7pm : "Peak Evenning"
hour.data$daypart[(hour.data$hr >=17 ) & (hour.data$hr <20 )] <- "Peak Evening"
# Factorization
hour.data$daypart <- factor(hour.data$daypart)
# Count by hour
g.cnt.hr <- ggplot(hour.data, aes(x = hr, y = cnt))
g.cnt.hr + geom_point(aes(color = daypart)) + ggtitle("Total Rental by Hour")
It is shown two peaks during the morning and evening peaking hours from 7am to 9am and from 5pm to 8pm. Let’s break it down into registered and casual users.
# Registered by hour
g.registered.hr <- ggplot(hour.data, aes(x = hr, y = registered))
g.registered.hr + geom_point(aes(color = daypart)) + ggtitle("Registered Rental by Hour")
The peaking hours are even more obvious for registered users. Apprently, many registered users commute to work by rental bikes.
# Casual by hour
g.casual.hr <- ggplot(hour.data, aes(x = hr, y = casual))
g.casual.hr + geom_point(aes(color = daypart)) + ggtitle("Casual Rental by Hour")
There was little impact of the peaking hour on casual users. It implies that people who commute by rental bikes mostly are the registered users. Also, lots of casual users tend to use the service in the afternoon, which may correlate with temperature or other weather factors (because starting from 11am, it gets hotter). We would examine our hypothesis later, the relationship between temperature and the causal users.
# Monthly total rental fluctuation in two years
year <- function(x) {
y =
if (x == 0) 2011
else 2012
return (y)
}
hour.data$year <- factor(sapply(hour.data$yr, year))
g.cnt.mnth <- ggplot(hour.data, aes(as.numeric(mnth), as.numeric(cnt), colour = as.factor(year)))
g.cnt.mnth + geom_smooth(se = FALSE, method = "auto") + ggtitle("Monthly Total Rental Over Two Years")
The ridership increased significantly in 2012. Furthermore, since 81% of the users are registered, we assumed that the ridership of registered users went up drastically. Let’s evaluate our assumption as followed.
# Monthly registered rental fluctuation in two years
g.registered.mnth <- ggplot(hour.data, aes(as.numeric(mnth), as.numeric(registered), colour = as.factor(year)))
g.registered.mnth + geom_smooth(se = FALSE, method = "auto") + ggtitle("Monthly Registered Rental Over Two Years")
Not only did the ridership of registered users increase significantly, there is also an interesting pattern. While the ridership of registered users of the first 7 months increased steadily, it appeared to be a jump from August in 2011. This might result from new policies or other environmental factors.
Notice, usage is generally lower in Winter, which may be related to lower temperature.
# Monthly casual rental fluctuation in two years
g.casual.mnth <- ggplot(hour.data, aes(as.numeric(mnth), as.numeric(casual), colour = as.factor(year)))
g.casual.mnth + geom_smooth(se = FALSE, method = "auto") + ggtitle("Monthly Casual Rental Over Two Years")
There are a lot more casual users in Summer and Fall than Spring and Winter. This fact also explains why casual users are more affected by the environmental settings than registered users.
Recall the previous graph, as compared to casual users, registered users’ usage curve are flatter than causal users’, because registered users who use bikes to commute are using them regularly relatively insensitive to the month.
# Count by hour on workingday
g.cnt.workday <- ggplot(hour.data, aes(x = hr, y = cnt, fill = as.factor(workingday)))
g.cnt.workday + geom_bar(stat = "identity", position="dodge") + ggtitle("Total Rental by Workingday")
1 denotes working day, while 0 denotes non-working day. On working days, the peaking hours are very obvious, while on non-working days, many people use the rental bikes in the afternoon. One way to look at this is that maybe on a non-working day, people like to use the service for fun.
# Registered on workingday
g.registered.workday <- ggplot(hour.data, aes(x = hr, y = registered, fill = as.factor(workingday)))
g.registered.workday + geom_bar(stat = "identity", position="dodge") + ggtitle("Registered Rental by Workingday")
On working days, most registered users use rental bikes during peak periods, meaning that, again, most registered users are bike commuters. On non-working days, their usage is less fluctuated. Let’s evaluate:
library(data.table)
##
## Attaching package: 'data.table'
## The following object is masked _by_ '.GlobalEnv':
##
## year
# Subsetting
sub.hour.data <- hour.data[,c("daypart", "cnt", "registered", "casual")]
# Create data table
dt <-as.data.table(sub.hour.data)
# Extract data where daypart is "Peak Morning" and "Peak Evening "
dt.peak <- dt[daypart %in% c("Peak Morning", "Peak Evening")]
# 42.3% of registered users use bike in peak hours
dt.peak[, sum(registered)] / dt[, sum(registered)]
## [1] 0.4230142
Notice that 81% of the users are registered and 42.3% of whom used bikes in peak hours, indicating the management needs to pay special attention to peak hours bike arrangement.
# Casual on workingday
g.casual.workday <- ggplot(hour.data, aes(x = hr, y = casual, fill = as.factor(workingday)))
g.casual.workday + geom_bar(stat = "identity", position="dodge") + ggtitle("Casual Rental by Workingday")
Given that most of the users are registered users (81%), of all 19% users are casual users. Unlike registered users, casual users’ pattern on working days is similar to non-working days. Notably, casual users tend to use the services on non-working days, especially in the day time, when human activities are vivid. Or maybe they couldn’t get enough bikes on working days due to their lower priority.
# Count on weekday
g.cnt.hr.byweekday <- ggplot(hour.data, aes(as.numeric(hr), as.numeric(cnt), colour = as.factor(weekday)))
g.cnt.hr.byweekday + geom_smooth(se = FALSE, method = "auto") + ggtitle("Total Rentals vs Hour (from Sunday to Monday)")
Rentals from Monday to Friday falls into one pattern, while rentals from Saturday and Sunday falls into the other. This pattern matches the result of the previous graph, reflecting that the management has to treat the arrangement of bikes on weekdays and weekends very differently.
Let’s evaluate our assumption:
# Check the propotion of registered & casual users on a working or non-working day
sub.hour.data <- hour.data[,c("workingday", "cnt", "registered", "casual")]
dt <-as.data.table(sub.hour.data)
dt.wd1 <- dt[workingday == 1]
dt.wd0 <- dt[workingday == 0]
# On a working day, 87% of users are registered
dt.wd1[, sum(registered)] /dt.wd1[, sum(cnt)]
## [1] 0.8677004
# While on a non-working day, casual users accounts 32% of total ridership
dt.wd0[, sum(casual)] /dt.wd0[, sum(cnt)]
## [1] 0.3166468
The business insight here is that, when it is a working day, the management should stress on providing the best services for registered users.
Conversely, when it is not, although registered users are still more than casual users, the demand for casual users becomes important, especially from 12- 17pm. On a non-working day, the proportion of causal users increases from 13% on a working day to 32%.
Just a picture of how temperature changes in a one-year time frame.
# mnth vs temp
g.temp.mnth <- ggplot(hour.data, aes(as.numeric(mnth), temp))
g.temp.mnth + geom_smooth(se = FALSE, method = "auto") + ggtitle("Temperature fluctuation in an Year")
As disscused above, we assumed that registered users less affected by environmental settings. Let’s see:
# Registered on temp
g.registered.temp <- ggplot(hour.data, aes(x = temp, y = registered))
g.registered.temp + geom_point()
Without surprise, the number of rental bikes for registered users does not change much according to temperature, which proves our assumption.
# Casual on temp
g.casual.temp <- ggplot(hour.data, aes(x = temp, y = casual))
g.casual.temp + geom_point()
Casual users are more sensitive to temperature than registered users. The usage is much higher between 20 to 30 Celsius degree. Interestingly, casual users find it unbearable when the temperature exceeds 30 degrees, hence, the ridership of casual users dropped greatly.
# Registered/casual on feeled temp
hour.data$raw.atemp <- hour.data$atemp*50
g.temp2 <- ggplot(hour.data, aes(x = raw.atemp, y = registered))
g.temp2 + geom_point()
g.temp2 <- ggplot(hour.data, aes(x = raw.atemp, y = casual))
g.temp2 + geom_point()
The atemp plot is similar to the temp plot.
# Registered/casual on humidity
g.registered.hum <- ggplot(hour.data, aes(x = hum, y = registered))
g.registered.hum + geom_point()
g.casual.hum <- ggplot(hour.data, aes(x = hum, y = casual))
g.casual.hum + geom_point()
Casual users are more sensitive to humidity than registered users, but the casual usage is also kind of smooth except extreme humidity (e.g. heavy rain). It indicates that biking activity is not relatively sensitive to humidity.
The plotting results have confirmed our assumption that hr, mnth, workingday, temp, hum have the major correlation with cnt/registered/casual. We also have the following interesting findings: 1) Most registered users commute to work by rental bike, while casual users do not. 2) 2012 showed an increase in users from 2011, contributed majorly by registered users. 3) On working days and non-working days, the usage pattern by hour differs a lot. 4) Casual users are more sensitive to weather condition than registered users.
The biking sharing system should allocate bikes considering these facts.
[1] Fanaee-T, Hadi, and Gama, Joao, “Event labeling combining ensemble detectors and background knowledge”, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.
@article{ year={2013}, issn={2192-6352}, journal={Progress in Artificial Intelligence}, doi={10.1007/s13748-013-0040-3}, title={Event labeling combining ensemble detectors and background knowledge}, url={http://dx.doi.org/10.1007/s13748-013-0040-3}, publisher={Springer Berlin Heidelberg}, keywords={Event labeling; Event detection; Ensemble learning; Background knowledge}, author={Fanaee-T, Hadi and Gama, Joao}, }
[2] https://rpubs.com/saitej09/bikesharing